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Abstract 

This  paper  studies  the  problem  of  accurately  recovering  a  sparse  vector  /3*  from 
highly  corrupted  linear  measurements  y  =  X/3*  +  e*  +  w  where  e*  is  a  sparse 
error  vector  whose  nonzero  entries  may  be  unbounded  and  w  is  a  bounded  noise. 
We  propose  a  so-called  extended  Lasso  optimization  which  takes  into  consider¬ 
ation  sparse  prior  information  of  both  /3*  and  e*.  Our  first  result  shows  that  the 
extended  Lasso  can  faithfully  recover  both  the  regression  and  the  corruption  vec¬ 
tors.  Our  analysis  is  relied  on  a  notion  of  extended  restricted  eigenvalue  for  the 
design  matrix  X.  Our  second  set  of  results  applies  to  a  general  class  of  Gaus¬ 
sian  design  matrix  X  with  i.i.d  rows  Af(0,  £),  for  which  we  provide  a  surprising 
phenomenon:  the  extended  Lasso  can  recover  exact  signed  supports  of  both  j3* 
and  e*  from  only  fl(fclogplogn)  observations,  even  the  fraction  of  corruption  is 
arbitrarily  close  to  one.  Our  analysis  also  shows  that  this  amount  of  observations 
required  to  achieve  exact  signed  support  is  optimal. 


1  Introduction 

One  of  the  central  problems  in  statistics  is  the  linear  regression  in  which  the  goal  is  to  accurately 
estimate  a  regression  vector  (3*  £  W  from  the  noisy  observations 

y  =  X/3*  +  w,  (1) 

where  X  £  Rraxp  is  the  measurement  or  design  matrix,  and  w  £  R"  is  the  stochastic  observation 
vector  noise.  A  particular  situation  recently  attracted  much  attention  from  research  community 
concerns  with  the  model  in  which  the  number  of  regression  variables  p  is  larger  than  the  number 
of  observations  n  ( p  >  n).  In  such  circumstances,  without  imposing  some  additional  assumptions 
for  this  model,  it  is  well  known  that  the  problem  is  ill-posed,  and  thus  the  linear  regression  is  not 
consistent.  Accordingly,  there  have  been  various  lines  of  work  on  high  dimensional  inference  based 
on  imposing  different  types  of  structure  constraints  such  as  sparsity  and  group  sparsity  [15]  [5]  [21], 
Among  them,  the  most  popular  model  focused  on  sparsity  assumption  of  the  regression  vector.  To 
estimate  /?,  a  standard  method,  namely  Lasso  [15],  was  proposed  to  use  l\ -penalty  as  a  surrogate 
function  to  enforce  sparsity  constraint. 

\\y~xP\\2  +  Aii^n1  > 

where  A  is  the  positive  regularization  parameter  and  /i-norm  11/3)^  is  defined  by  H/3^  =  Y^i=i  IAI- 

During  the  past  few  years,  there  has  been  numerous  studies  to  understand  the  i\ -regularization  for 
sparse  regression  models  [23]  [11]  [10]  [17]  [4]  [2]  [22],  These  works  are  mainly  characterized  by 
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the  type  of  the  loss  functions  considered.  For  instance,  some  authors  [4]  seek  to  obtain  a  regression 
estimate  /3  that  delivers  small  prediction  error  while  other  authors  [2]  [11]  [22]  seek  to  produce  a 
regressor  with  minimal  parameter  estimation  error,  which  is  measured  by  the  f^-norm  of  (/?  —  /?*). 
Another  line  of  work  [23]  [17]  considers  the  variable  selection  in  which  the  goal  is  to  obtain  an 
estimate  that  correctly  identifies  the  support  of  the  true  regression  vector.  To  achieve  low  prediction 
or  parameter  estimation  loss,  it  is  now  well  known  that  it  is  both  sufficient  and  necessary  to  impose 
certain  lower  bounds  on  the  smallest  singular  values  of  the  design  matrix  [10]  [2],  while  a  notion  of 
small  mutual  incoherence  for  the  design  matrix  [4]  [23]  [17]  is  required  to  achieve  accurate  variable 
selection. 

We  notice  that  all  the  previous  work  relies  on  the  assumption  that  the  observation  noise  has  bounded 
energy.  Without  this  assumption,  it  is  very  likely  that  the  estimated  regressor  is  either  not  reliable 
or  unable  to  identify  the  correct  support.  With  this  observation  in  mind,  in  this  paper,  we  extend  the 
linear  model  (1)  by  considering  the  noise  with  unbounded  energy.  It  is  clear  that  if  all  the  entries 
of  y  is  corrupted  by  large  error,  then  it  is  impossible  to  faithfully  recover  the  regression  vector  /?*. 
However,  in  many  practical  applications  such  as  face  and  acoustic  recognition,  only  a  portion  of  the 
observation  vector  is  contaminated  by  gross  error.  Formally,  we  have  the  mathematical  model 

y  =  X(3 *  +  e*  +  w,  (3) 

where  e*  £  Rn  is  the  sparse  error  whose  locations  of  nonzero  entries  are  unknown  and  magnitudes 
can  be  arbitrarily  large  and  w  is  another  noise  vector  with  bounded  entries.  In  this  paper,  we  assume 
that  w  has  a  multivariate  Gaussian  Af(0,  <J2Inxn)  distribution.  This  model  also  includes  as  a  par¬ 
ticular  case  the  missing  data  problem  in  which  all  the  entries  of  y  is  not  fully  observed,  but  some 
are  missing.  This  problem  is  particularly  important  in  computer  vision  and  biology  applications. 
If  some  entries  of  y  are  missing,  the  nonzero  entries  of  e*  whose  locations  are  associated  with  the 
missing  entries  of  the  observation  vector  y  have  the  same  values  as  entries  of  y  but  with  inverse 
signs. 

The  problems  of  recovering  the  data  under  gross  error  has  gained  increasing  attentions  recently  with 
many  interesting  practical  applications  [18]  [6]  [7]  as  well  as  theoretical  consideration  [9]  [13]  [8], 
Another  recent  line  of  research  on  recovering  the  data  from  grossly  corrupted  measurements  has 
been  also  studied  in  the  context  of  robust  principal  component  analysis  (RPCA)  [3]  [20]  [1].  Let  us 
consider  some  examples  to  illustrate: 

•  Face  recognition.  The  model  (3)  has  been  originally  proposed  by  Wright  et  al.  [19]  in 
the  context  of  face  recognition.  In  this  problem,  a  face  test  sample  y  is  assumed  to  be 
represented  as  a  linear  combination  of  training  faces  in  the  dictionary  X,  y  =  Xf5  where  3 
is  the  coefficient  vector  used  for  classification.  However,  it  is  often  the  case  that  the  face  is 
occluded  by  unwanted  objects  such  as  glasses,  hats  etc.  These  occlusions,  which  occupy  a 
portion  of  the  test  face,  can  be  considered  as  the  sparse  error  e*  in  the  model  (3). 

•  Subspace  clustering.  One  of  the  important  problem  on  high  dimensional  analysis  is  to 
cluster  the  data  points  into  multiple  subspaces.  A  recent  work  of  Elhamifar  and  Vidal  [6] 
showed  that  this  problem  can  be  solved  by  expressing  each  data  point  as  a  sparse  linear 
combination  of  all  other  data  points.  Coefficient  vectors  recovered  from  solving  the  Lasso 
problems  are  then  employed  for  clustering.  If  the  data  points  are  represented  as  a  matrix  X, 
then  we  wish  to  find  a  sparse  coefficient  matrix  B  such  that  X  =  XB  and  <liag( B)  =  0. 
When  the  data  is  missing  or  contaminated  with  outliers,  [6]  formulates  the  problem  as 
X  =  XB  +  E  and  minimize  a  sum  of  two  t\  -norms  with  respect  to  both  B  and  E. 

•  Sensor  network.  In  this  model,  sensors  collect  measurements  of  a  signal  /3*  independently 
by  simply  projecting  /?*  onto  row  vectors  of  a  sensing  matrix  X,  y^  =  ( Xi,  ft *).  The 
measurements  yi  are  then  sent  to  the  center  hub  for  analysis.  However,  it  is  highly  likely 
that  some  sensors  might  fail  to  send  the  measurements  correctly  and  sometimes  report 
totally  irrelevant  measurements.  Therefore,  it  is  more  accurate  to  employ  the  observation 
model  (3)  than  model  (1). 

It  is  worth  noticing  that  in  the  aforementioned  applications,  e*  plays  the  role  as  the  sparse  (unde¬ 
sired)  error.  However,  in  many  other  applications,  e*  can  contain  meaningful  information,  and  thus 
necessary  to  be  recovered.  An  example  of  this  kind  is  signal  separation,  in  which  3*  and  e*  are  two 
distinct  signal  components  (video  or  audio).  Furthermore,  in  applications  such  as  classification  and 
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clustering,  the  assumption  that  the  test  sample  y  is  a  linear  combination  of  a  few  training  samples  in 
the  dictionary  (design  matrix)  X  might  be  violated.  This  sparse  component  e*  can  thus  be  seen  as 
the  compensation  for  linear  regression  model  mismatch. 

Given  the  observation  model  (1)  and  the  sparsity  assumptions  on  both  regression  vector  /?*  and  error 
e*,  we  propose  the  following  convex  minimization  to  estimate  the  unknown  parameter  j3*  as  well  as 
the  error  e*. 

min  i  || y  -Xfi-  ef2  +  \p  H/?^  +  Ae  \\e\h  ,  (4) 

p,e  Z 

where  \?>  and  Ae  are  positive  regularization  parameters.  This  optimization,  we  call  extended  Lasso, 
can  be  seen  as  a  generalization  of  the  Lasso  program.  Indeed,  by  setting  Ae  =  0,  (4)  returns  to 
the  standard  Lasso.  The  additional  regularization  associated  with  e  encourages  sparsity  on  the  error 
where  parameter  Ae  controls  the  sparsity  level.  In  this  paper,  we  focus  on  the  following  questions: 
what  are  necessary  and  sufficient  conditions  for  the  ambient  dimension  p,  the  number  of  observations 
n,  the  sparsity  index  k  of  the  regression  3*  and  the  fraction  of  corruption  so  that  (i)  the  extended 
Lasso  is  able  (or  unable)  to  recover  the  exact  support  sets  of  both  /3*  and  e*?  (ii)  the  extended  Lasso 
is  able  to  recover  /3*  and  e*  with  small  prediction  error  and  parameter  error?  We  are  particularly 
interested  in  understanding  the  asymptotic  situation  where  the  the  fraction  of  error  is  arbitrarily  close 
to  100%. 

Previous  work.  The  problem  of  recovering  the  estimation  vector  /?*  and  error  e*  has  originally 
proposed  and  analyzed  by  Wright  and  Ma  [18].  In  the  absence  of  the  stochastic  noise  w  in  the 
observation  model  (3),  the  authors  proposed  to  estimate  (/?*,  e*)  by  solving  the  linear  program 

min||^||i  +  ||e||i  s.t.  y  =  X/3  +  e.  (5) 

/3,e 

The  result  of  [18]  is  asymptotic  in  nature.  They  showed  that  for  a  class  of  Gaussian  design  matrix 
with  i.i.d  entries,  the  optimization  (5)  can  recover  (/3*,  e*)  precisely  with  high  probability  even  when 
the  fraction  of  corruption  is  arbitrarily  close  to  one.  However,  the  result  holds  under  rather  stringent 
conditions.  In  particularly,  they  require  the  number  of  observations  n  grow  proportionally  with  the 
ambient  dimension  p,  and  the  sparsity  index  k  is  a  very  small  portion  of  n.  These  conditions  is 
of  course  far  from  the  optimal  bound  in  compressed  sensing  (CS)  and  statistics  literature  (recall 
k  <  0(n /  logp)  is  sufficient  in  conventional  analysis  [17]). 

Another  line  of  work  has  also  focused  on  the  optimization  (5).  In  both  papers  of  Laska  et  al.  [7]  and 
Li  et  al.  [9],  the  authors  establish  that  for  Gaussian  design  matrix  X,  if  n  >  C(k  +  s)  logp  where  s 
is  the  sparsity  level  of  e*,  then  the  recovery  is  exact.  This  follows  from  the  fact  that  the  combination 
matrix  [X,  I ]  obeys  the  restricted  isometry  property,  a  well-known  property  used  to  guarantee  exact 
recovery  of  sparse  vectors  via  i\  -minimization.  These  results,  however,  do  not  allow  the  fraction  of 
corruption  close  to  one. 

Among  the  previous  work,  the  most  closely  related  to  the  current  paper  are  recent  results  by  Li  [8] 
and  Nguyen  et  al.  [13]  in  which  a  positive  regularization  parameter  A  is  employed  to  control  the 
sparsity  of  e*.  Using  different  methods,  both  sets  of  authors  show  that  as  A  is  deterministically  se¬ 
lected  to  be  l/ylogp  and  X  is  a  sub-orthogonal  matrix,  then  the  solution  of  following  optimization 
is  exact  even  a  constant  fraction  of  observation  is  corrupted.  Moreover,  [8]  establishes  a  similar 
result  with  Gaussian  design  matrix  in  which  the  number  of  observations  is  only  an  order  of  k  log  p  - 
an  amount  that  is  known  to  be  optimal  in  CS  and  statistics. 

mm||£Hi  +  A  Hellr  s.t.  y  =  X/3  +  e.  (6) 

P,e 

Our  contribution.  This  paper  considers  a  general  setting  in  which  the  observations  are  contaminated 
by  both  sparse  and  dense  errors.  We  allow  the  corruptions  to  linearly  grow  with  the  number  of 
observations  and  have  arbitrarily  large  magnitudes.  We  establish  a  general  scaling  of  the  quadruplet 
(■ n,p ,  k,  s)  such  that  the  extended  Lasso  stably  recovers  both  the  regression  and  corruption  vectors. 
Of  particular  interest  to  us  are  the  following  equations: 

(a)  First,  under  what  scalings  of  ( n,p ,  k,  s)  does  the  extended  Lasso  obtain  the  unique  solution 
with  small  estimation  error. 

(b)  Second,  under  what  scalings  of  ( n,p ,  k)  does  the  extended  Lasso  obtain  the  exact  signed 
support  recovery  even  almost  all  the  observations  are  corrupted? 
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(c)  Third,  under  what  scalings  of  (n,p,  k,  s)  does  no  solution  of  the  extended  Lasso  specify 
the  correct  signed  support? 

To  answer  for  the  first  question,  we  introduce  a  notion  of  extended  restricted  eigenvalue  for  a  matrix 
[X,  I]  where  I  is  an  identity  matrix.  We  show  that  this  property  satisfies  for  a  general  class  of 
random  Gaussian  design  matrix.  The  answers  to  the  last  two  questions  requires  stricter  conditions 
for  the  design  matrix.  In  particular,  for  random  Gaussian  design  matrix  with  i.i.d  rows  7V(0,  E),  we 
rely  on  two  standard  assumptions:  invertibility  and  mutual  incoherence. 

If  we  denote  Z  =  [X,  I]  where  /  is  an  identity  matrix  and  /3  =  [/3*  ,  e*  ]T,  then  the  observation 
vector  y  is  reformulated  as  y  =  ZB  +  w,  which  is  the  same  as  standard  Lasso  model.  However, 
previous  results  [2]  [17]  applying  to  random  Gaussian  design  matrix  are  irrelevant  to  this  setting 
since  the  Z  no  longer  behave  like  a  Gaussian  matrix.  To  establish  theoretical  analysis,  we  need 
more  study  on  the  interaction  between  the  Gaussian  and  identity  matrices.  By  exploiting  the  fact 
that  the  matrix  Z  consists  of  two  component  where  one  component  has  special  structure,  our  analysis 
reveals  an  interesting  phenomenon:  extended  Lasso  can  accurately  recover  both  the  regressor  f3*  and 
corruption  e*  even  when  the  fraction  of  corruption  is  up  to  100%.  We  measure  the  recoverability  of 
these  variables  under  two  criterions:  parameter  accuracy  and  feature  selection  accuracy.  Moreover, 
our  analysis  can  be  extended  to  the  situation  in  which  the  identity  matrix  can  be  replaced  by  a  tight 
frame  D  as  well  as  extended  to  other  models  such  as  group  Lasso  or  matrix  Lasso  with  sparse  error. 

Notation  We  summarize  here  some  standard  notation  used  throughout  the  paper.  We  reserve  T 
and  S  as  the  sparse  support  of  B*  and  e*,  respectively.  Given  and  design  matrix  X  £  K”xp  and 
subsets  S  and  T,  we  use  X$t  to  denote  the  |5|  x  \T\  submatrix  obtained  by  extracting  those  rows 
indexed  by  S  and  columns  indexed  by  T.  We  use  the  notation  G'i ,  C->  ,c,  \ ,  o> ,  etc.,  to  refer  to  positive 
constants,  whose  value  may  change  from  line  to  line.  Given  two  functions  /  and  g,  the  notation 
f(n)  =  0(g(n))  means  that  there  exists  a  constant  c  <  +oo  such  that  f(n)  <  cg(ri)\  the  notation 
f(n)  =  fl(g(n))  means  that  /(n)  >  cg(n)  and  the  notation  /(n)  =  Q(g(n))  means  that  f{n)  = 
(ff(n))  and  f(n)  =  f l(g(n)).  The  symbol  f(n)  =  o(g(n))  means  that  f(n)/g(n)  —>  0. 

2  Main  results 

In  this  section,  we  provide  precise  statements  of  the  main  results  of  this  paper.  In  the  first  sub¬ 
section,  we  establish  the  parameter  estimation  and  provide  a  deterministic  result  which  bases  on  the 
notion  of  extended  restricted  eigenvalue.  We  further  show  that  the  random  Gaussian  design  matrix 
satisfies  this  property  with  high  probability.  The  next  sub-section  considers  the  feature  estimation. 
We  establish  conditions  for  the  design  matrix  such  that  the  solution  of  the  extended  Lasso  has  the 
exact  signed  supports. 

2.1  Parameter  estimation 

As  in  conventional  Lasso,  to  obtain  a  low  parameter  estimation  bound,  it  is  necessary  to  impose 
conditions  on  the  design  matrix  X.  In  this  paper,  we  introduce  a  notion  of  extended  restricted 
eigenvalue  (extended  RE)  condition.  Let  C  be  a  restricted  set,  we  say  that  the  matrix  X  satisfies  the 
extended  RE  assumption  over  the  set  C  if  there  exists  some  m  >  0  such  that 

\\Xh  + f\\2>  Ki{\\h\\2  +  \\f\\2)  for  all  (h,  f)  £  C,  (7) 

where  the  restricted  set  C  of  interest  is  defined  with  Xn  :=  \e/\p  as  follow 

C  :=  {(h,f)  £  r  X  r  I  IIMIr+AJI/HI^SlIMr+SAJI/sllJ.  (8) 

This  assumption  is  a  natural  extension  of  the  restricted  eigenvalue  condition  and  restricted  strong 
convexity  considered  in  [2]  [14]  and  [12],  In  the  absent  of  a  vector  /  in  the  equation  (7)  and  in  the 
set  C,  this  condition  returns  to  the  restricted  eigenvalue  defined  in  [2],  As  explained  at  more  length 
in  [2]  and  [16],  restricted  eigenvalue  is  among  the  weakest  assumption  on  the  design  matrix  such 
that  the  solution  of  the  Lasso  is  consistent. 

With  this  assumption  at  hand,  we  now  state  the  first  theorem 
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Theorem  1.  Consider  the  optimal  solution  (/?,  e)  to  the  optimization  problem  (4)  with  regularization 
parameters  chosen  as 


\P>-  ||x*«,|| 

7 


j  x  Ae  ML 

and  Xn  :=  —  =  7 

L  L  W  o, 


(9) 


where  7  £  (0, 1].  Assuming  that  the  design  matrix  X  obeys  the  extended  RE,  then  the  error  set 
( h ,  /)  =  (/?  —  /3* ,  e"  —  e*)  is  bounded  by 

||^||2  +  || /|| 2  —  2  k  +  •  (10) 


There  are  several  interesting  observations  from  this  theorem 

1)  The  error  bound  naturally  split  into  two  components  related  to  the  sparsity  indices  of  3*  and  e*. 
In  addition,  the  error  bound  contains  three  quantity:  the  sparsity  indices,  regularization  parameters 
and  the  extended  RE  constant.  If  the  terms  related  to  the  corruption  e*  are  omitted,  then  we  obtain 
similar  parameter  estimation  bound  as  the  standard  Lasso  [2]  [12], 

2)  The  choice  of  regularization  parameters  Ap  and  Ae  can  make  explicitly:  assuming  w  is  a  Gaussian 
random  vector  whose  entries  are  Af(  0,  er2)  and  the  design  matrix  has  unit-normed  columns,  it  is  clear 
that  with  high  probability,  ||X*w;||  <  2 yj a2  log p  and  ||tn||  <  2 yj a2  log  n.  Thus,  it  is  sufficient 
to  select  Xp  >  ^  \J o'2  log  p  and  Xe  >  4 \J a2  log  n. 

3)  At  the  first  glance,  the  parameter  7  does  not  seem  to  have  any  meaningful  interpretation  and  the 
7  =  1  seems  to  be  the  best  selection  due  to  the  smallest  estimation  error  it  can  produce.  However, 
this  parameter  actually  control  the  sparsity  level  of  the  regression  vector  with  respect  to  the  fraction 
of  corruption.  This  relation  is  made  via  the  restricted  set  C. 


In  the  following  lemma,  we  show  that  the  extended  RE  condition  actually  exists  for  a  large  class  of 
random  Gaussian  design  matrix  whose  rows  are  i.i.d  zero  mean  with  covariance  E.  Before  stating  the 
lemma,  let  us  define  some  quantities  operating  on  the  covariance  matrix  E:  C'trljtl  :=  Amin(E)  is  the 
smallest  eigenvalue  of  E,  Cmax  :=  Amax(E)  is  the  biggest  eigenvalue  of  E  and  £(E)  :=  maxi  E,;, 
is  the  maximal  entry  on  the  diagonal  of  the  matrix  E. 


Lemma  1.  Consider  the  random  Gaussian  design  matrix  whose  rows  are  i.i.d  J\f( 0,  E)  and  assume 
rc2Cmax£(E:)  =  0(1).  Select 


A  n  •  — 


<  log  n 


jn\  log p' 


(11) 


then  with  probability  greater  than  1  —  C\  exp(— c^n),  the  matrix  X  satisfies  the  extended  RE  with 
parameter  Ki  =  provided  that  n  >  C  ^  ^  k  log  p  and  s  <  min  |Ci  £>gn ,  C'2n|  for  some 
small  constants  C\,  Ci- 


We  would  like  to  make  some  remarks: 

1)  The  choice  of  parameter  Xn  is  nothing  special  here.  When  design  matrix  is  Gaussian  and  indepen¬ 
dent  with  the  Gaussian  stochastic  noise  w,  we  can  easily  show  that  ||2f*u:||^.  <  2y/£(E)n<52  logp 
with  probability  at  least  1  —  2  exp(—  logp).  Therefore,  the  selection  of  A„  follows  from  Theorem  1. 

2)  The  proof  of  this  lemma,  shown  in  the  Appendix,  boils  down  to  control  two  terms 

•  Restricted  eigenvalue  with  X. 

\\Xh\\l  +  \\f\\l>Krml  +  \\f\\l)  for  all  (h,f)eC. 

•  Mutual  incoherence.  Column  space  of  the  matrix  X  is  incoherent  with  the  column  space 
of  the  identity  matrix.  That  is,  there  exists  some  Krn  >  0  such  that 

\(XhJ)\<Kmm2  +  \\f\\2)2  for  all  (hJ)£C. 

If  the  incoherence  between  these  two  column  spaces  is  sufficiently  small  such  that  1  Km  <  nr,  then 
we  can  conclude  that  \\Xh  +  f  || 2  >  (ftr  —  2«:m)(||/i||2  +  ||/||2)2.  The  small  mutual  incoherence 
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property  is  especially  important  since  it  provides  how  the  regression  separates  away  from  the  sparse 
error. 


3)  To  simplify  our  result,  we  consider  a  special  case  of  the  uniform  Gaussian  design,  in  which 
£  =  ~Ipxp-  In  this  situation,  Cmm  =  Cmax  =  £(£)  =  1/n.  We  have  the  following  result  which  is 
a  corollary  of  Theorem  1  and  Lemma  1 

Corollary  1  (Standard  Gaussian  design).  Let  X  be  a  standard  Gaussian  design  matrix.  Consider 
the  optimal  solution  (/?,  e)  to  the  optimization  problem  (4)  with  regularization  parameters  chosen  as 


Xp  > 


-s/a'2  log p 

7 


and 


Xe  >  \\J a2  log  n, 


(12) 


where  7  €  (0, 1],  Also  assuming  that  n  >  Cklogp  and  s  <  min{Gi  -g  ,  C 2?t}  for  some  small 
constants  C\,C2 ■  Then  with  probability  greater  than  1  —  C\  exp(— C2n),  the  error  set  (h,  f)  = 
((3  —  /?*,  e  —  e*)  is  bounded  by 


ll^ll2  +  II/II2  —  384 


—  sj a2  k  log  p  +  yj a2  s  log  n 


(13) 


Corollary  1  reveals  an  interesting  phenomenon:  by  setting  7  =  1/ i/log  n,  even  when  the  fraction 
of  corruption  is  linearly  proportional  with  the  number  of  samples  n,  the  extended  Lasso  (4)  is  still 
capable  to  recover  both  coefficient  vector  /?*  and  corruption  (missing)  vector  e*  within  a  bounded 
error  (13).  Without  the  dense  noise  w  in  the  observation  model  (3)  (a  =  0),  the  extended  Lasso 
recovers  the  exact  solution.  This  result  is  impossible  to  achieve  with  standard  Lasso.  Furthermore,  if 
we  know  in  prior  that  the  number  of  corrupted  observations  is  an  order  of  0(n/  log  p),  then  selecting 
7=1  instead  of  1/  log  n  will  minimize  the  estimation  error  (see  equation  ( 1 3))  of  Theorem  1 . 


2.2  Feature  selection  with  random  Gaussian  design 

In  many  applications,  the  feature  selection  criteria  is  more  preferred  [17]  [23].  Feature  selection 
refers  to  the  property  that  the  recovered  parameter  has  the  same  signed  support  as  the  true  regressor. 
In  general,  good  feature  selection  implies  good  parameter  estimation  but  the  reverse  direction  does 
not  usually  hold.  In  this  part,  we  investigate  conditions  for  the  design  matrix  and  the  scaling  of 
(■ n,p ,  k,  s)  such  as  both  regression  and  sparse  error  vectors  obtain  this  criteria. 

Consider  the  linear  model  (3)  where  X  is  the  Gaussian  random  design  matrix  whose  rows  are  i.i.d 
zero  mean  with  covariance  matrix  £.  It  has  been  well  known  in  the  Lasso  that  in  order  to  obtain 
feature  selection  accuracy,  the  covariance  matrix  £  must  obey  two  properties:  invertibility  and  small 
mutual  coherence  restricted  on  the  set  T.  The  first  property  guarantees  that  (4)  is  strictly  convex, 
leading  to  the  unique  solution  of  the  convex  program,  while  the  second  property  requires  the  sepa¬ 
ration  between  two  components  of  £,  one  related  to  the  set  T  and  the  other  to  the  set  Tc  must  be 
sufficiently  small. 

1.  Invertibility.  To  guarantee  uniqueness,  we  require  T,tt  to  be  invertible.  Particularly,  let 
Cm i„  =  Amin(STT),  we  require  Cmi„  >  0. 

2.  Mutual  incoherence.  For  some  76  (0, 1), 

||£^cT(£tt)  1||00<2^  —  t)  (14) 

where  ||-||  refers  to  £00/^00  operator  norm.  It  is  worth  noting  that  in  the  standard  Lasso 
the  factor  1  is  omitted.  Our  condition  is  tighter  than  condition  used  to  establish  feature 
estimation  in  the  Lasso  by  a  constant  factor.  In  fact,  the  quantity  1/2  is  nothing  special 
here  and  we  can  set  any  value  close  to  one  with  a  compensation  that  the  number  of  samples 
n  will  increase.  Thus,  we  put  1/2  for  the  simplicity  of  the  proof. 

Toward  the  end,  we  will  also  elaborate  three  other  quantities  operating  on  the  restricted  co- 
variance  matrix  £7-7-:  Cmax,  which  is  defined  as  the  maximum  eigenvalue  of  'LTT:  Cmax  := 
Amax(STr);  -D“ax  and  -D+ax,  which  are  denoted  as  foo-norm  of  matrices  and  Tjtt  • 

■^max  :=  || (Stt)'1!^  and  L>+ax  :=  HEttIIoq. 
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Our  result  also  involves  in  two  other  quantities  operating  on  the  conditional  covariance  matrix  of 
(Xt<;\Xt)  defined  as 


\ -  Xi'J'c'J'C  -  £^ 


(15) 


We  then  define  P„(£t<=|t)  —  maxi(£Tc|T)ij  and  P/(£tc |t)  —  \  mini^[(STo|T)jj  +  (£Tc| T)jj  — 
2(£t’c| T)ij]-  Toward  the  end,  we  denote  a  shorthand  pu  and  p\. 


We  establish  the  following  result  for  Gaussian  random  design  whose  covariance  matrix  £  obeys  the 
two  assumptions. 

Theorem  2.  (Achievability)  Given  the  linear  model  (3)  with  random  Gaussian  design  and  the  co- 
variance  matrix  £  satisfy  invertibility  and  incoherence  properties  for  any  7  £  (0, 1),  suppose  we 
solve  the  extended  Lasso  (4)  with  regularization  parameters  obeying 


A p  =  ^^ma,x{pu,D^ax}na2\ogp  and  Ae  =  8i/cr2  logn. 


(16) 


Also,  let  77  =  ^Xiogn’  ^ le  secluence  (n,  p,  k,  s)  and  regularization  parameters  Ap,  Ae  satisfying 
s  <  r/n 

n  >  max  /ci  1  -^-k\og(p  -  k),C2  ,  V  log(p  -  fc)  logn}  , 

l  (1  -  V)  T'min  (1  -  V)  6min  J 

(17) 

where  C\  and  C2  are  numerical  constants.  In  addition,  suppose  that  mxniej’  |/3*|  >  fp{Ap)  and 
minieS  |e*|  >  /e( Ap,  Ae)  where 


f/3  ■=  Cl 


n  —  s 


A/3  k\og(p-k) 


1-1/2 

JTT 


20< 


a2  log  k 


Cmin(n  -  s ) 


fe  :=  c2(C'max(fcv/i  +  sVk.))1/2 J k ^ 

n  —  s  V  n 


1-1/2 

-‘TT 


and 


+  c3Ae. 


(18) 


(19) 


77zen  the  following  properties  holds  with  probability  greater  than  1  —  c  exp(— c'  max{log  n,  log  pfc}) 


1. 

2. 


The  solution  pair  ( (3,  e)  of  the  extended  Lasso  (4)  is  unique  and  has  exact  signed  support. 


Ina-norm  bounds: 


<  M A/s)  and  || e"  e*)^  <  fe(Xp). 


There  are  several  interesting  observations  from  the  theorem 

1)  The  first  and  important  observation  is  that  extended  Lasso  is  robust  to  arbitrarily  large  and  sparse 
error  observation.  In  that  sense,  the  extended  Lasso  can  be  viewed  as  a  generalization  of  the  Lasso. 
Under  the  same  invertibility  and  mutual  incoherence  assumptions  on  the  covariance  matrix  £  as 
the  standard  Lasso,  the  extended  Lasso  program  can  recover  both  the  regression  vector  and  error 
with  exact  signed  supports  even  when  almost  all  the  observations  are  contaminated  by  arbitrarily 
large  error  with  unknown  support.  What  we  sacrifice  for  the  corruption  robustness  is  an  additional 
log  factor  to  the  number  of  samples.  We  notice  that  when  the  error  fraction  is  0(n/logn),  only 
0(klog(p  —  k))  samples  are  sufficient  to  recover  the  exact  signed  supports  of  both  regression  and 
sparse  error  vectors. 


2)  We  consider  the  special  case  with  Gaussian  random  design  in  which  the  covariance  matrix  £  = 
-plpxp-  In  this  case,  entries  of  X  is  i.i.d  Af(0, 1/n)  and  we  have  quantities  Cmin  =  Cmax  = 
74ax  =  £>-ax  =  pu  =  Pi  =  1.  In  addition,  the  invertibility  and  mutual  incoherence  properties 
are  automatically  satisfied.  The  theorem  implies  that  when  the  number  of  errors  s  is  close  to  n, 
the  number  of  samples  n  needed  to  recover  exact  signed  supports  satisfies  zffp  =  U(/clog(p  — 
k)).  Furthermore,  Theorem  2  guarantees  consistency  in  element-wise  ^oc-norm  of  the  estimated 


regression  at  the  rate 


P-P* 


As  7  is  chosen  to  be  1/ -^32  logn  (equivalent  to  establish  s  close  to  n),  the  £ ^  error  rate  is  an  order 
of  O(a^/logp),  which  is  known  to  be  the  same  as  that  of  the  standard  Lasso. 
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3)  Corollary  1,  though  interesting,  is  not  able  to  guarantee  stable  recovery  when  the  fraction  of 
corruption  converges  to  one.  We  show  in  Theorem  2  that  this  fraction  can  come  arbitrarily  close 
to  one  by  sacrificing  a  factor  of  logn  for  the  number  of  samples.  Theorem  2  also  implies  that 
there  is  a  significant  difference  between  recovery  to  obtain  small  parameter  estimation  error  versus 
recovery  to  obtain  correct  variable  selection.  When  the  amount  of  corrupted  observations  is  linearly 
proportional  with  n,  recovering  the  exact  signed  supports  require  an  increase  from  fl(fclogp)  (in 
Corollary  1)  to  logp  log  n)  samples  (in  Theorem  2).  This  behavior  is  captured  similarly  by  the 
standard  Lasso,  as  pointed  out  in  [17],  Corollary  2. 


Our  next  theorem  show  that  the  number  of  samples  needed  to  recover  accurate  signed  support  is 
optimal.  That  is,  whenever  the  rescaled  sample  size  satisfies  (20),  then  for  whatever  regularization 
parameters  \p  and  Ae  are  selected,  no  solution  of  the  extended  Lasso  correctly  identifies  the  signed 
supports  with  high  probability. 

Theorem  3.  (Inachiev ability)  Given  the  linear  model  (3)  with  random  Gaussian  design  and  the 
covariance  matrix  £  satisfy  invertibility  and  incoherence  properties  for  any  7  £  (0, 1).  Let  p  = 
32j2  log(n-s)  an d  le  sequence  (n,p,  k ,  s)  satisfies  s  >  pn  and 


n  <  min  <  C-i 


(1  -  V)  Ca 


-klog(p-  k),C4 


V  min{pi,.D+ax} 


(1  -rif 


Cn 


k  log (p  —  k )  log(l  —  rj)n  1 


\Jer2  log 


n 


(20) 

where  C3  and  C4  are  some  small  universal  constants.  Then  with  probability  tending  to  one,  no 
solution  pair  of  the  extended  Lasso  (5)  has  the  correct  signed  support. 


3  Illustrative  simulations 

In  this  section,  we  provide  some  simulations  to  illustrate  the  possibility  of  the  extended  Lasso  in 
recovering  the  exact  regression  signed  support  when  a  significant  fraction  of  observations  is  cor¬ 
rupted  by  large  error.  Simulations  are  performed  for  a  range  of  parameters  ( n,p ,  k,  s)  where  the 
design  matrix  X  is  uniform  Gaussian  random  whose  rows  are  i.i.d  A/(0.  Ipxp).  For  each  fixed  set  of 
(n,p,  k,  s),  we  generate  sparse  vectors  j3*  and  e*  where  locations  of  nonzero  entries  are  uniformly 
random  and  magnitudes  are  Gaussian  distributed. 

In  our  experiments,  we  consider  varying  problem  sizes  p  =  {128,256,512}  and  three  types  of 
regression  sparsity  indices:  sublinear  sparsity  ( k  =  0.2 p/  log(0.2p)),  linear  sparsity  ( k  =  O.lp)  and 
fractional  power  sparsity  ( k  =  0.5p°'75).  In  all  cases,  we  fixed  the  error  support  size  s  =  n/2. 
This  means  half  of  the  observations  is  corrupted.  By  this  selection.  Theorem  2  suggests  that  number 
of  samples  n  >  2Ck\og(p  —  fc)logn  to  guarantee  exact  signed  support  recovery.  We  choose 
=  49k\og(p  —  k)  where  parameter  9  is  the  rescaled  sample  size.  This  parameter  control  the 
success/failure  of  the  extended  Lasso. 

In  the  algorithm,  we  select  \p  =  2 \J o2  log p logn  and  Ae  =  a'2  log n  as  suggested  by  Theorem 

2,  where  the  noise  level  a  =  0.1  is  fixed.  The  algorithm  reports  a  success  if  the  solution  pair  has 
the  same  signed  support  as  (/3*,  e*).  In  Fig.  1,  each  point  on  the  curve  represents  the  average  of  100 
trials. 

As  demonstrated  by  simulations,  our  extended  Lasso  is  cable  to  recover  the  exact  signed  support 
of  both  j3*  and  e*  even  50%  of  the  observations  are  contaminated.  Furthermore,  up  to  unknown 
constants,  our  theorem  2  and  3  match  with  simulation  results.  As  the  sample  size  <  2k  log(p  — 
k),  the  probability  of  success  starts  going  to  zero,  implying  the  failure  of  the  extended  Lasso. 
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